Skip to content

feat: T7 default-on resourceContext + content search + log pipe primitives#755

Open
nadaverell wants to merge 35 commits into
mainfrom
test/t6-t89-combined
Open

feat: T7 default-on resourceContext + content search + log pipe primitives#755
nadaverell wants to merge 35 commits into
mainfrom
test/t6-t89-combined

Conversation

@nadaverell
Copy link
Copy Markdown
Contributor

@nadaverell nadaverell commented May 21, 2026

Summary

Bundle of bench-validated agent-surface improvements developed and tested on the combined branch (test/t6-t89-combined). Will be rebased / cherry-picked into smaller parts as pieces merge.

T7 — MCP get_resource default-on resourceContext

  • New internal/mcp/rc_rbac.go: request-scoped RefAccessChecker adapter
  • New internal/mcp/resource_context.go: MCP-side IssueSummary / AuditSummary / PolicyReport composers + topology lookup
  • tools.go: getResourceInput.Context, group-first dispatch (skip typed cache when group set), Build wrapping into {resource, resourceContext} envelope
  • Validated end-to-end against gke koalabackend nonprod

pkg/resourcecontext additions

  • ServiceBackends lookup interface (resolve Service selector → ready Pods)
  • PodSummary, WorkloadSummary, StatusSummary projections
  • Owner-ref handling + audit severity normalization
  • Tests across build.go and types.go (+398 LoC test coverage)

Search content-field matching (drives the B5 secret-blast-radius win — 4.5× faster than kubectl)

  • ContentField type + content-exact/substring scoring
  • MatchSnippet per hit (180-rune excerpt + which content field matched)
  • Extracts from ConfigMap/Secret data, workload env refs, CRD spec fields

Log pipe primitives (kubectl-arm parity)

  • FilterLogsByPattern in pkg/ai/context/logs.go: server-side kubectl logs | grep
  • get_pod_logs + get_workload_logs accept grep, since (kubectl --since), previous (kubectl -p for CrashLoopBackOff diagnosis)
  • parseLogsSince helper with full unit tests
  • Evidence: container_kill kubectl-arm transcript shows heavy time-bounded log queries; this closes the gap

Tool description tightening (descriptions remain the highest-leverage knob)

  • tools_audit.go: "compliant resources are not returned" — addresses B7 audit false-positive vector
  • Minor description fixes across tools_helm.go, tools_neighborhood.go, tools_rbac.go, tools_apply.go, tools_workloads.go

Bench context

Last N=1 run (May 20): same accuracy as kubectl on B-series (3/4 each), radar 3.3× faster, 18% cheaper. B5 win attributed to content-field search. B7 loss diagnosed: agent calls 'issues source="problem,condition,audit"' and conflates operational issues with security posture — issues-side description fix tracked separately.

Test plan

  • Build + unit tests pass locally
  • N=3 benchmark on B1 / B3 / B5 / B7 + target_port 2/3 + payment / cart / image + ≥1 log-heavy scenario (after merge)
  • B5 single-call content-search win reproduces at N=3
  • B7 decoy (api-gateway) no longer flagged after audit description hits agent context
  • target_port follow-up: inspect failed traces + live EndpointSlice objects before deciding data vs description fix
  • Hub passthrough verified (separate PR — T13)

Note

Medium Risk
Touches multiple AI-facing surfaces (MCP tools, /api/ai handlers, search, issues composition, and log retrieval) with new RBAC gates and new response shapes, so regressions could affect authorization, correctness of resource disambiguation by group, or performance on large clusters.

Overview
Adds richer AI/agent responses and fixes correctness/perf edge cases across issues, search, and resource retrieval. MCP get_resource now returns {resource, resourceContext} by default (opt-out via context=none), with request-scoped RBAC checks, topology-backed relationships, and per-resource issue/audit/policy rollups; MCP list_resources and search can also attach summaryContext (managedBy/health/issueCount) and now route group-qualified kinds straight to the dynamic cache to avoid core/CRD collisions.

Improves search and logs for agent workflows. Search now indexes selected object content (e.g. spec/status/data, ConfigMap data) and returns per-hit snippets, while explicitly excluding Secret values; it also defers summaryContext building until after sorting/limit truncation. get_pod_logs/get_workload_logs gain grep, since, and previous support with validation.

Issue composition becomes group-aware and more scalable. Issues now populate canonical API group for Problems/Audit/Events (with fallback mapping), add NoLimit to disable caps for summary-context indexing, and short-circuit generic CRD condition scans when Filters.Kinds narrows the request to a single kind. REST /api/ai/resources gains context=none, summaryContext attachment on list, group-first dynamic routing, and shared RBAC preflight helpers, with new tests pinning group disambiguation and authorization behavior.

Reviewed by Cursor Bugbot for commit 3d33d73. Bugbot is set up for automated code reviews on this repo. Configure here.

nadaverell added 30 commits May 18, 2026 22:37
…ET (T6)

Adds the basic-tier resourceContext builder and wires it into the
/api/ai/resources/{kind}/{ns}/{name} GET handler. The package layer is
pure-Go and depends only on pkg/topology — callers in internal/* pre-
compute IssueSummary and AuditSummary so this package doesn't reach
into internal/issues or internal/audit.

What ships:

- pkg/resourcecontext/build.go: Build(ctx, obj, opts) walks topology
  relationships + pod spec to populate ManagedBy (Argo tracking-id /
  Flux labels / owner ref), Exposes, SelectedBy (PDB + NetworkPolicy
  split), Uses (ConfigMaps/Secrets/PVCs/SA), RunsOn, ScaledBy, and
  PolicySummary. Every ContextRef passes through opts.AccessChecker;
  denials are dropped and recorded in Omitted with rbac_denied reason.

- pkg/resourcecontext/hints.go: SynthesizeHints renders 5–8 short,
  deterministic prose lines for AI consumers. EmitHints=false on UI
  callers leaves Hints empty.

- internal/server/rc_rbac.go: requestScopedChecker wraps Server.canRead
  with a per-request (verb,group,kind,namespace) memoization layer.
  Collapses the ~30 ref-checks per response into ~5 SAR calls without
  changing the underlying PermissionCache TTL.

- internal/server/ai_handlers.go: handleAIGetResource now returns
  { resource, resourceContext } by default. ?context=none keeps the
  pre-T6 bare shape. IssueSummary uses issues.ComposeWithStats scoped
  to the resource; AuditSummary uses pkg/audit.IndexByResource.
  handleAIListResources is unchanged — T8/T9 territory.

Golden tests cover Pod (full enrichment + RBAC denial + JSON shape),
Deployment (Flux HelmRelease label precedence), Service (Ingress
Exposes), NetworkPolicy (outgoing EdgeProtects intentionally not
surfaced), ConfigMap (owner-chain only), PolicyReports rollup, HPA
identity, and the EmitHints=false path. Hints determinism is pinned
by a separate golden in hints_test.go.
Reviewer caught that buildAIResourceContext never set
opts.PolicyReports, so policySummary.kyverno stayed absent on the REST
get-resource path even after T5 (index) and T11 (composer source). The
Build generator already supports it via PolicyReportLookup; the handler
just wasn't passing it.

Adds a narrow policyReportLookupAdapter that wraps
internal/k8s.GetPolicyReportIndex() — translating the richer
pkg/policyreports.Finding shape (Severity + Category) into the
agent-facing resourcecontext.KyvernoFinding shape (Policy / Rule /
Result / Message). Keeping the projection narrow at the adapter layer
means future additions to policyreports.Finding don't perturb the wire
contract.

When Kyverno isn't installed, GetPolicyReportIndex() returns nil and
opts.PolicyReports stays unset — Build emits no policySummary, which
is the correct degraded behavior.
Reviewer caught a plan-vs-code mismatch: the locked v1 contract says
basic tier emits Kyverno counts only (fail/warn/pass), with Top[]
findings reserved for diagnostic tier. The code emitted Top[] on both
tiers, inflating basic-tier wire size with details that belong on the
deeper agent investigation path.

Fix: buildPolicySummary now takes ContextTier. Basic emits counts only;
diagnostic adds the Top[] (cap 3, ordered fail > warn > error > pass).
Existing TestBuild_PolicyReports_RolledUp split into two cases that
pin both tier outputs.
… Pod hygiene fields (T23 dedup)

T23 (#720) made topology.Relationships the canonical projection for
server-synthesized ManagedBy (owner-chain + GitOps detection) and Pod
hygiene fields (ServiceAccount, Node). Drop the parallel owner-walk
and label/annotation scanning inside resourcecontext.Build — read
Relationships.{ManagedBy,ServiceAccount,Node} instead.

Build now calls topology.GetRelationshipsWithObject when Topology is
set, passing the fetched obj so kind/group disambiguation works
(Knative serving.knative.dev/Service vs core/v1 Service). Single-
resource callers (REST GET /api/ai/resources) pass idx=nil — the per-
call O(E) scan is fine for one walk per request; bulk callers should
pre-build via topology.IndexByResource(topo) and pass through
Options.RelIndex or Options.Relationships.

No deprecation aliases: buildManagedBy, parseArgoTrackingID,
pickControllerOwner, groupFromAPIVersion, readPair, and the
gitops/flux label constants are deleted outright (per saved feedback
preference). Their tests are removed — the same logic is exercised
in pkg/topology/managedby_test.go.
handleAIGetResource skipped the user-RBAC gates that handleGetResource
runs, so a user with namespace access but no per-namespace get-secrets
SAR could read Secret values via the AI endpoint, and a user without
cluster-scoped node SAR could read Node objects. The AI surface returns
the same resource bytes (just minified + wrapped) as the REST surface,
so it must enforce the same gates.

Extract the gate block into Server.preflightResourceGet and call it
from both handlers. Single helper keeps the two endpoints in lockstep
so future RBAC adjustments touch one place. Tests cover the three deny
arms (per-ns get-secrets, cluster-scoped get-node, namespace access)
plus a passing-case sanity check on the AI envelope shape.
…k + selectedBy mislabel

Reviewer's critical (#721): computeAuditSummaryForResource dropped the
Kind from its signature and matched only (namespace, name) by iterating
the audit index. A Deployment "web" in "prod" silently inherited audit
findings from a Service or ConfigMap of the same name in the same
namespace. The loop's map-iteration order also made TopFinding
non-deterministic across runs for tied severities — breaking the
determinism guarantee SynthesizeHints downstream advertises.

Fix derives the canonical kind from obj's TypeMeta in
buildAIResourceContext (Pascal singular — exactly what audit's check
runner writes into Finding.Kind) and threads it into the audit lookup.
The lookup is now a single map access via bpaudit.ResourceKey instead
of an O(n) scan. TopFinding selection sorts by (severity desc, CheckID
asc) so ties resolve identically every run.

Important fixes also addressed:

- RunsOn fallback gap (pkg/resourcecontext/build.go:197): the `else if
  rel == nil` guard meant pod.Spec.NodeName was used only when there
  was no topology at all. When topology was present but rel.Node was
  nil (Node informer cold, node not yet indexed), RunsOn stayed empty
  even though the Pod spec named a node. Now falls back any time
  rel.Node is missing.

- selectedByHint mislabel (pkg/resourcecontext/hints.go:124): every
  non-PDB ref was rendered as NetworkPolicy, including any future kind
  added to SelectedBy. Now explicit-match each known kind
  (PodDisruptionBudget, NetworkPolicy) and drop unrecognized kinds
  through summarizeKindsCounts instead of mislabeling.
… kind collisions

fetchAIResource was calling FetchResource (typed cache) before consulting
the group qualifier. For /api/ai/resources/services/ns/name?group=
serving.knative.dev, the typed cache returns the core/v1 Service and the
?group= is silently dropped, leaking the wrong object via the AI surface.

Branch on group != "" first and route directly to GetDynamicWithGroup,
mirroring handleGetResource's dispatch in server.go. The typed-first path
remains correct (and faster) when no group is passed.

Same bug class as T12's group-blind root lookup, on the single-resource
GET path.
…ch hits (T8+T9)

Add pkg/resourcecontext.BuildSummary — a tiny per-row enrichment helper that
returns ManagedBy + Health + IssueCount (≤ ~60 bytes/row). Wire it through:

- REST /api/ai/resources/{kind} list handler (handleAIListResources +
  aiListDynamic)
- MCP list_resources tool (handleListResources + listDynamicResources)
- REST /api/search executor (Hit.SummaryContext)
- MCP search tool (handleSearch)

Each handler builds the per-namespace issue index (via
internal/issues.Compose) and topology snapshot once per request, then
invokes BuildSummary per row. pkg/resourcecontext stays free of
internal/* and pkg/topology imports — callers pre-compute ManagedBy
via the new ManagedByFromOwner helper.

All four surfaces honor a context=none opt-out (query param for REST,
input arg for MCP) that returns bare rows.

Tests:

- Golden-file BuildSummary across Pod phases, Deployment replica states,
  NetworkPolicy (no health heuristic), CRD Ready/Available conditions,
  Health override, nil-object safety.
- ManagedByFromOwner source classification (argocd / flux / native).
- attach* helpers wired end-to-end with stub builders; defensive length-
  mismatch handling.
- managedByFromRelationships prefers Deployment grandparent shortcut over
  noisy ReplicaSet owner.
- Search executor invokes SummaryBuilder per kept hit when set, leaves
  SummaryContext nil when unset.
…+ shared index)

Now that T23 (#720) merged the server-synthesized Relationships.ManagedBy
chain and the RelationshipsIndex inverted-edges lookup, the per-row
SummaryContext builder no longer needs to re-detect ownership and no
longer pays O(E) per call.

- Read rel.ManagedBy[0] for the ManagedByRef (ArgoCD > Flux > Helm >
  topmost K8s owner). Falls back to rel.Owner when synthesis declines.
  Drops the local Deployment-grandparent shortcut — ManagedBy walks past
  the noisy hash-suffixed ReplicaSet for us.
- Build topology.IndexByResource(topo) once per request and thread it
  into GetRelationshipsWithObject on every row. Without the index, the
  list_resources + search hot paths were O(N x E).
- Use GetRelationshipsWithObject (not -WithIndex) so synthesis is
  group-aware when the row has the typed/unstructured object handy —
  avoids kind/plural collisions like Knative Service vs corev1 Service.
- Update server tests to pin the ManagedBy preference and the new
  Owner fallback shape.
…Helm classification

Three reviewer findings on PR #722:

1. issueIndexKey was group-blind: two CRDs sharing kind+ns+name in
   different API groups (Knative Service vs corev1 Service, two
   operators each shipping a "Cluster" CRD) collided on the same
   bucket and inherited each other's count. Threaded group through
   the builder + key — format is now "group|kind|ns|name", "|" is
   illegal in K8s API groups so it's a safe delimiter.

2. buildIssueIndex used Limit:MaxLimit (1000), so on >1000-issue
   clusters resources whose issues fell in the post-sort tail
   silently got issueCount=0. Picked Option B: added an explicit
   NoLimit sentinel in internal/issues. Fewer files touched than
   Option A (a separate CountByResource API), and the index is the
   only caller that wants the uncapped set — public /api/issues +
   issues_list keep their tight caps. ComposeStats.TotalMatched
   already reported the true total regardless of truncation.

3. sourceForOwner returned "native" for {Kind:"HelmRelease", Group:""}
   — but topology.detectManagedByFromMeta emits exactly that shape
   for native Helm installs (distinguished from Flux's HelmRelease
   CR which lives at helm.toolkit.fluxcd.io). Added a "helm" branch
   ahead of the group-based GitOps switch.

Tests:
- pkg/resourcecontext: native Helm classification.
- internal/server + internal/mcp: group-aware index (two CRDs same
  kind+ns+name), tail-of-MaxLimit overflow.

Signature change: summaryContextBuilder and search.SummaryBuilderFunc
now take `group` as a parameter. All callers updated to source group
from typed object GVK (SetTypeMeta path) or unstructured apiVersion.
…em.Group for built-ins

Extracts preflightResourceList as a shared helper between handleListResources
and handleAIListResources. The AI list path previously skipped the
cluster-scoped SAR (Node, PVs, cluster-scoped CRDs), the list-namespaces
SAR (kind=namespaces), and the per-namespace / cluster-wide list-secrets
SAR — bypassing the gates the REST list path enforces. Same gates now run
in the same order on both paths; REST keeps its 200-with-[] legacy shape
for denies, AI returns the explicit 403 so agents see the failure instead
of confusing "empty cluster" output.

Populates Problem.Group on all built-in DetectProblems sites (Deployment,
StatefulSet, DaemonSet → "apps"; HPA → "autoscaling"; Job, CronJob →
"batch"). The new group-aware summary_context index keys per-resource
issue counts as "group|kind|ns|name", so empty Group was zeroing
issueCount for every broken workload — a regression vs the pre-group-aware
behavior. CAPI sites already set Group, verified.

Tests:
- internal/server/ai_handlers_rbac_test.go: 403 on per-namespace secrets
  deny, cluster-scope nodes deny, list-namespaces deny; happy-path 200
  asserts summaryContext envelope.
- internal/k8s/problems_test.go: fake clientset with broken Deployment +
  StatefulSet + DaemonSet + maxed HPA + stuck Job, asserts each emitted
  Problem.Group matches its canonical API group.
…lter comment

Reviewer's important (#722): MCP list_resources and search build a full
topology snapshot on every invocation. REST reuses the broadcaster's
cached snapshot but MCP has no equivalent shared cache, so on a
multi-thousand-resource cluster every agent list/search paid a
multi-second topology build cost.

Fix: introduce a package-level topology.Memoizer (5s TTL — matches the
REST broadcaster's cadence) and route buildSummaryContextTopology
through it. Bursty agent traffic now amortizes the build cost over
many calls instead of paying it per request. Other MCP tools
(handleGetResource, get_neighborhood) still build inline; threading
them through is a separate follow-up.

Suggestion: also fix the misleading comment in
internal/server/summary_context.go that claimed "Sources are restricted
to 'problem' + 'condition'". Filters.Sources was never set — the
exclusion of audit/event sources comes from the false-by-default
IncludeAudit / IncludeEvents flags. Updated the comment to reference
the actual mechanism so future readers don't assume there's an explicit
Sources allowlist.
Three correctness/perf fixes on the summaryContext list path:

1. Cluster-scoped issues silently filtered out for cluster-scoped rows.
   handleAIListResources / handleListResources passed the user's
   namespaced-access set straight into the issue index. Cluster-scoped
   issues (Node, PV, cluster-scoped CRDs) live at namespace="" — the
   namespaced filter dropped them all, zeroing issueCount on every Node
   row even when the user had cluster-scoped Node access. Now: detect
   cluster-scoped kinds via k8s.ClassifyKindScope and pass nil for the
   issue index (cluster-wide compose). RBAC was already enforced upstream.

2. detectGenericCRDIssues scanned every watched CRD before kindFilter
   applied. A pods-only list_resources request still iterated every
   watched CRD GVR and called ListDynamic on each; applyFilters then
   discarded the rows. On clusters with hundreds of CRDs this dominated
   the per-row issue index build. Push f.Kinds awareness down — match
   the GVR's kind via p.KindForGVR (lowercase comparison mirrors
   applyFilters) and skip the ListDynamic call entirely for
   non-matching GVRs.

3. Trailing blank line at EOF in internal/mcp/summary_context.go.

Tests:
- internal/server: issueIndexNamespaces helper drops the namespace
  filter for cluster-scoped kinds and preserves it for namespaced kinds.
  buildIssueIndex end-to-end: a Node issue at namespace="" surfaces
  when the index is built without a namespace filter and disappears
  when a namespace slice is passed (matches CacheProvider's per-ns walk).
- internal/mcp: mirror of the server test against the MCP buildIssueIndex.
- internal/issues: countingProvider asserts detectGenericCRDIssues does
  NOT call ListDynamic for GVRs whose kind is filtered out, and DOES
  call it for matching GVRs and when Kinds is empty.

Fake provider's DetectProblems in both server + MCP tests updated to
mirror CacheProvider.flattenNamespacedProblems (drop cluster-scoped
rows under a namespace filter) so the regression test pins the actual
production behavior.
…ting

Two residual findings on the T89 work:

1. Search summaryContext dropped issueCount for cluster-scoped hits.
   handleSearch passed scanNamespaces (a strictly-namespaced filter)
   into the single issue index used for every hit. Search returns
   MIXED-kind hits (a query can return both namespaced Pods and
   cluster-scoped Nodes), and cluster-scoped problems live at
   namespace="" — the per-namespace filter dropped them, silently
   zeroing issueCount on every Node/PV/cluster-scoped-CRD hit.

   Fix: dual issue index per search request. namespacedIdx is scoped
   to scanNamespaces; clusterIdx is composed cluster-wide (nil filter)
   so namespace="" issues surface. The summaryContextBuilder closure
   dispatches per-hit via k8s.ClassifyKindScope(kind, group), so a
   single response correctly counts both Pod and Node issues.
   CanReadClusterScoped already gates which cluster-scoped kinds the
   user can see — the cluster-wide index doesn't expose unauthorized
   rows. Mirrored on the MCP search path.

2. AI list with group ignored group and tried typed cache first.
   handleAIListResources called k8s.FetchResourceList(cache, kind,
   namespaces) before consulting group, so kind=services&group=
   serving.knative.dev silently returned core Services. Same fix
   shape PR #721 applied to GET: group != "" short-circuits straight
   to aiListDynamic. Mirrored on MCP handleListResources.

Tests:
- server: TestSummaryContextBuilderFromIndexes_DispatchesByScope and
  TestNewSearchSummaryContextBuilder_BuildsDualIndex pin the per-hit
  scope-based dispatch and the dual-index construction shape.
- server: TestAI_ListServices_WithGroup_RoutesToDynamicCache pins the
  group routing on /api/ai/resources/{kind} (typed core Service for
  no group; dynamic-cache path for group=serving.knative.dev).
- mcp: same two builder tests plus
  TestHandleListResources_GroupRoutesToDynamic for the MCP list path.
…uncation

Two Bugbot findings on PR #722:

1. deriveHealth for Deployment and ReplicaSet compared ReadyReplicas against
   Status.Replicas (current pod count) instead of *Spec.Replicas (desired).
   During scale-down or rolling updates Status.Replicas can exceed Spec —
   surplus pods draining — and ReadyReplicas trails. The previous code
   reported "degraded" even when all DESIRED replicas were ready, which is
   the steady-state-healthy condition. StatefulSet/DaemonSet cases were
   already correct. Two new regression tests pin the scale-down scenario.

2. search.go::Search called SummaryBuilder inline inside buildHit for every
   matched candidate, BEFORE sort + truncate to opts.Limit. A broad query
   matching thousands of objects paid topology lookups for all of them,
   shipped at most Limit=50. Refactored to buffer hits + their source obj
   in a private pendingHit slice, sort and truncate that, then run
   SummaryBuilder only on the kept hits. The deferred topology lookups
   now scale with output size, not match size.
After the deferred-summary refactor, every buildHit call site passes nil
for summaryBuilder; the actual attachment happens in Search's post-
truncation loop. The parameter + the `if summaryBuilder != nil` branch
inside buildHit are dead code and a maintenance trap — a future caller
passing non-nil would defeat the post-truncation optimization the
refactor exists to enable.

Drop the parameter, remove the inner branch, and tighten the doc to
explicitly say SummaryContext attachment is Search's responsibility
after sort + Limit truncation. No behavior change; existing tests cover
both code paths.
Commit 9ba47cb landed a partial rename of SummaryContext →
ResourceSummaryContext and ManagedByRef.Ref → ResourceSummaryRef in
summary.go / summary_test.go / search.go, but types.go still defines
the flat SummaryContext + ManagedByRef shape that every consumer in
internal/server, internal/mcp, internal/search, and pkg/ai/context
references. The half-applied rename left the tree in a broken-build
state — `go build ./...` failed with `undefined: ResourceSummaryContext`
and `unknown field Ref in struct literal of type ManagedByRef`.

Restore the original SummaryContext / ManagedByRef shape in summary.go
+ search.go + the summary_test.go golden. Spec.Replicas health fix from
9ba47cb stays.
The REST and MCP SummaryContext builders carried ~200 LoC of duplicated
implementation — the issue-index map, group-aware key arithmetic,
canonicalSingular kind normalization, BuildIssueIndex compose loop,
managedByFromRelationships extraction, and the per-row dispatch
closure body that picks namespaced vs cluster-wide index by scope. The
only substantive difference was the topology source (REST: broadcaster
cache; MCP: 5s memoizer).

Lift the shared core into a new internal/summarycontext package that
exports:
- Builder (the per-request closure type)
- IssueIndex with Count(group, kind, ns, name) int
- CanonicalSingular(kind) string
- BuildIssueIndex(provider, namespaces, kindFilter) IssueIndex
- ManagedByFromRelationships(rel) *ManagedByRef
- BuilderFromIndexes(topo, namespacedIdx, clusterIdx) Builder

BuilderFromIndexes takes the topology snapshot as an argument so REST
and MCP keep their respective sources unchanged — REST passes
s.broadcaster.GetCachedTopology(), MCP passes the memoized build.

internal/server/summary_context.go and internal/mcp/summary_context.go
shrink to thin wrappers that own their topology source and forward
everything else to the shared package. The MCP file additionally keeps
its local summaryCtxTopoMemo + buildSummaryContextTopology helpers
(MCP-specific, no analogue on the REST side).

Pure-function tests (issue-index key arithmetic, BuildIssueIndex over a
fake provider, CanonicalSingular, ManagedByFromRelationships, the
dual-index dispatch shape) move into the new package alongside the
shared code. The REST-side wiring tests (attachSummaryContextToList
end-to-end, issueIndexNamespaces dispatch) stay in internal/server
since they exercise handler plumbing. The MCP-side test file was
pure-function-only and is deleted outright — coverage now lives in
internal/summarycontext.

Net: ~390 LoC dropped from the tree; no wire-shape or behavior changes.
…text

Telegraph the relationship to ResourceContext: this is the row-tier
companion of the detail-tier ResourceContext. The previous "SummaryContext"
name implied a different concept; the new name makes the role explicit
(row-tier enrichment on lists + search hits).

ManagedByRef stays flat — only the type rename plus the field reference
chase across server / mcp / search / summarycontext / ai-context callers.
Wire JSON tag is unchanged ("summaryContext") so external consumers see
no shape difference.
groupFromObject, groupFromUnstructured, and the two
attachResourceSummaryContext* helpers were byte-identical between
internal/server/ai_handlers.go and internal/mcp/tools.go. The
internal/summarycontext package exists explicitly to centralize
shared per-row builder logic — moved all four there as
GroupFromObject, GroupFromUnstructured, AttachToTypedList, and
AttachToUnstructuredList.

Eliminates the maintenance hazard of two copies drifting. REST and
MCP wrappers now share one source of truth alongside the existing
Builder type.
…ount

BuildIssueIndex used to set filters.Kinds = [CanonicalSingular(kindFilter)]
on the Compose call. CanonicalSingular only knows built-in plurals, so
a CRD listed by its plural form (e.g. ArgoCD "applications") fell
through unchanged. Compose's case-insensitive Kind filter then failed
against the singular "Application" the issue engine emits, and every
CRD row's issueCount silently became 0. The MCP tool description
encourages plural forms ("e.g. pods, deployments"), so this hit the
common path for agents inspecting CRDs.

Fix: drop the Kinds filter entirely. The per-row bucketing in
issueIndexKey(group, kind, ns, name) already discriminates correctly
because the lookup side runs the row's singular Kind through
CanonicalSingular too — index and query agree without needing a
Compose-time filter. Per-namespace issue volumes are tiny (~tens to
low hundreds), so the extra bucket work is negligible.

Removed `kindFilter` from BuildIssueIndex's signature and from both
newResourceSummaryContextBuilder wrappers (REST + MCP) since neither
needs it anymore. Added TestBuildIssueIndex_CRDPlural_NonZeroCount to
pin the contract.
…lationship lookup

Two residual bugs on the AI GET resourceContext path were dropping signal
silently:

1) computeIssueSummaryForResource was called with the URL-plural kind
   ("deployments"), which the issues composer's Filters.Kinds matcher
   case-folds but does NOT plural-to-singular convert. Issue.Kind is the
   canonical Pascal singular ("Deployment"), so every issue got filtered
   out and IssueSummary.Count silently collapsed to 0 (Build then omits
   the field entirely). Fix: pass canonicalKind (derived from obj.GVK)
   into the function — same pattern computeAuditSummaryForResource was
   already using.

2) GetRelationshipsWithObject was called with the URL-plural kind, which
   buildNodeID resolved to the wrong topology node for cross-group CRDs
   whose Kind collides with a core kind. For a Knative serving.knative.dev
   Service request, "services" → "service/ns/name" picked the CORE
   Service node instead of "knativeservice/ns/name", so relationship
   walks returned the core Service's edges. Same shape for CAPI Cluster
   ("capicluster") and Istio Gateway ("istiogateway").

   Added topology.KindForGVK as a small exported helper that maps
   (kind, group) → pseudo-kind for the three known cross-group
   collisions; the handler now funnels gvk through it before the
   relationship lookup. Non-colliding kinds (core, apps, batch, Gateway
   API, etc.) pass through unchanged so buildNodeID's existing kindMap
   handles them.

Tests:
- pkg/topology/pseudokinds_test.go: table tests covering every remap
  case plus the pass-throughs (including the wrong-group-same-kind
  guards: Service under argoproj.io, Route under route.openshift.io).
- internal/server/ai_handlers_group_test.go:
  * TestAIGetResource_GroupRoutesRelationshipsToKnative — seeds a
    Knative Service in the dynamic cache co-named with the core Service
    plus an Ingress backend-ref'd to the core Service. Asserts the
    knative-routed response does NOT leak the core Service's Ingress
    into resourceContext.exposes (locked the regression to fail
    without the KindForGVK funnel).
  * TestAIGetResource_IssueSummaryCountsURLPluralKind — asserts a
    broken Deployment (UnavailableReplicas=3) surfaces with count > 0
    in resourceContext.issueSummary when fetched via the URL plural.

Fixture additions in TestMain:
- broken/stuck-app Deployment (UnavailableReplicas=3): seeds a known
  problem for the issue-summary regression test, in its own namespace
  so it doesn't perturb default-namespace smoke tests.
- default/nginx-ingress Ingress routing to Service "nginx": the
  distinguishing edge between core Service and Knative Service
  topology nodes.
Removes the prose `Hints []string` field, the `EmitHints` option, the
SynthesizeHints generator, and all associated tests. ~430 LoC of net
removal.

Rationale: our dominant agent consumer (Claude-class) composes triage
prose from the structured fields trivially. The pre-baked Hints array
added wire bytes + prompt tokens with no net signal for that consumer.
Once shipped, agents pattern-matching on Hint substrings would have
ossified the wording. The structured fields (ManagedBy, Exposes,
SelectedBy, Uses, RunsOn, ScaledBy, IssueSummary, AuditSummary,
PolicySummary, Omitted) carry every fact a derived prose line would
encode — agents that need narrative can compose it.

If a real consumer emerges that needs deterministic prose, add it as a
separate `explain_resource` MCP tool. Keeping it inline was a premature
bet against asymmetric costs: easy to add later, hard to remove or
evolve once agents depend on the strings.

The doc comment on ResourceContext records the decision so future
readers don't re-introduce the projection without revisiting the
tradeoff.
Radar is OpenAPI-first per CLAUDE.md, but that discipline was adopted
to keep the Go backend and the TypeScript SPA client in sync — one
spec, regenerated as server stubs + TS client. /api/ai/* has no SPA
consumer; agents read MCP tool descriptions or in-prompt instructions,
not OpenAPI specs.

Without this doc, a future maintainer might reflexively bring
/api/ai/* under openapi.yaml on the assumption that radar's
OpenAPI-first stance covers every endpoint. That would pay the
spec-authoring tax during agent-surface evolution without earning
the SDK-generation benefit. The wave-2 wire shapes already evolved
across three review rounds; locking them down in YAML during that
churn would have produced spec/code drift, not contract clarity.

Top-of-file doc records the intentional opt-out + the revisit
triggers (surface stability + public-SDK commitment). Future
readers see explicit reasoning instead of accidental omission.
…, 500 on unknown

Three Bugbot findings on PR #721:

1. Build's relationship fallback used `ident.Kind` directly (the raw GVK
   kind like "Service") when opts.Relationships was nil but opts.Topology
   was set. For Knative serving.knative.dev/Service this resolved to the
   CORE Service topology node, leaking core relationships into the CRD's
   resourceContext — defeating the KindForGVK fix that the handler-side
   pre-computation already applies. Mirror the handler's resolution here
   so the fallback agrees with the primary path.

2. buildUsesFromPod had the dominant-pattern Reason labels reversed:
   ConfigMaps got ReasonEnvVarRef, Secrets got ReasonVolumeMount.
   ConfigMaps surface primarily via volume mounts (config files);
   Secrets primarily via env (SecretKeyRef). Swapped to match the
   common case. Doc comment clarifies the label is best-effort per
   kind — both paths feed both kinds in practice.

3. writeAIFetchError default branch returned 404 NOT_FOUND for any
   unrecognized error. Unknown errors are server-side (e.g. "resource
   discovery not initialized") and should surface as 500, not be
   masked as missing-resource. One-line fix.

No test changes — the fixes are correctness-only and existing tests
remain valid.
Remove fields and enum values with no consumer in v1 and no concrete
plan to populate them in T7/T10/T13:

- ContextRef.Reason (RefReason enum, 10 values): structurally redundant
  with the parent field name (selectedBy → selector match, runsOn →
  node binding, etc.). The ConfigMap/Secret refs were tagged with a
  single Reason that was inaccurate for the env-from vs volume-mount
  half of each set — actively misleading on the wire.
- ContextRef.Source (RefSource enum, 5 values): internal provenance
  describing which radar subsystem produced the fact. No agent or UI
  consumer branches on it; planned policy_report/audit_engine sources
  never get a ContextRef anyway (T10 emits PolicySummary/AuditSummary
  rollups).
- ContextRef.Confidence: reserved field, never set. Defer until fuzzy
  joins (Trivy/ConfigAudit) actually land.
- Options.MaxTokens: reserved field, never enforced. Add when there's
  budgeting code.
- ResourceContext.Truncated: never set by Build — no truncation path
  exists in the generator (RBAC drops are reported via Omitted).
- OmittedKindUnsupported + OmittedProviderDisabled: unused enum values.
  Kept rbac_denied / budget_exceeded / cache_cold / not_installed —
  all four are populated today or planned for T10's policy-report
  diagnostic per the TODO in internal/k8s/policy_reports.go.
- Drop the handler-side pre-compute of Relationships in
  buildAIResourceContext. Build's own fallback already does the
  identical KindForGVK pseudo-kind resolution; pre-computing here
  doubled the work whenever the lookup returned nil (handler call
  returned nil → opts.Relationships=nil → Build's `rel==nil &&
  opts.Topology!=nil` branch re-ran the same scan).

- Sort issue-summary rows by (severity desc, Reason asc) before
  picking topReason. The sibling computeAuditSummaryForResource
  already sorts (severity desc, CheckID asc) for determinism;
  the issue path's iteration-order pick could vary across runs
  when multiple rows tie on severity.
…ped lookups

computeAuditSummaryForResource unconditionally passed []string{namespace}
to audit.RunFromCache, even when namespace=="" (cluster-scoped
resources). That filters audit to literally namespace="" resources
instead of scanning all namespaces. computeIssueSummaryForResource
already guards with `if namespace != ""` — match it.

Latent today since the audit suite doesn't currently cover
cluster-scoped kinds, but the inconsistency would silently miss
findings the moment a cluster-scoped check lands.
Two T6 fixes required after rebasing onto post-#723 (Kyverno) and
post-#726 (RBAC reverse-lookup + Crossplane) main:

- T11 (#723) threaded API group through the PolicyReport lookup,
  changing policyreports.Index.FindingsFor from (kind, ns, name) to
  (group, kind, ns, name). Updated resourcecontext.PolicyReportLookup
  interface, the build.go call site (passing ident.Group), the
  policyReportLookupAdapter in ai_handlers.go, and the test mock.
  Group threading is a strict improvement — two CRDs sharing
  kind+ns+name across API groups now get disjoint findings instead
  of one inheriting the other's.

- internal/server/server_smoke_test.go acquired a `rbacv1` import on
  main (#726) at the same line our `networkingv1` import lands on
  this branch. Conflict resolution: keep both, sorted.
…vider

T11 (#723) added KyvernoFindings + KyvernoStatus to the issues.Provider
interface so the composer can route PolicyReport findings into the
unified issue stream and surface index-lifecycle status. Our test
fake didn't implement either, so summarycontext tests stopped
compiling after the rebase onto post-T11 main. Returning nil/"" is
correct for these tests: BuildIssueIndex doesn't read Kyverno
findings (kindFilter was dropped in the prior commit; the index
buckets problem+condition only), and KyvernoStatus is consumed by
the issues meta block — not by the index path under test.
Cross-group kind collisions (e.g. core/Service vs serving.knative.dev/Service
sharing namespace + name) previously caused audit/issue summaries
to silently inherit findings from the wrong resource. The fetch and
relationship paths in T6 were already group-aware; the summary
lookups were not.

pkg/audit:
- Finding and ResourceGroup gain a Group field (omitempty on the wire).
- ResourceKey signature changes to (group, kind, ns, name) — encoded
  "group|Kind|ns|name" with "|" delimiter matching the issue-source
  key in internal/summarycontext.
- Exported GroupForBuiltinKind(kind) — single source of truth for the
  built-in (Kind→Group) map. buildResults populates Finding.Group via
  this helper so the 20 per-check emission sites stay terse.

internal/server:
- buildAIResourceContext threads canonicalGroup from obj GVK through
  to both computeIssueSummaryForResource and computeAuditSummaryForResource.
- Audit lookup uses the new group-aware ResourceKey.
- Issue summary adds a strict row.Group == group check inside the
  match loop. The previous kind-only filter could silently pull in
  a colliding core kind's issues for a CRD lookup.
- UI audit drill-down (handleResourceAudit) explicitly passes group=""
  since the URL doesn't carry group today — comment points to #35
  for the CRD-aware drill-down work.

internal/issues:
- Centralised Group resolution via resolveGroup(group, kind): use the
  explicit group if set, else fall back to audit.GroupForBuiltinKind.
- Applied at fromProblem (legacy k8s.DetectProblems sites still emit
  Group="" for built-in workloads), fromAudit (passes through
  audit.Finding.Group which buildResults populates), fromWarningEvent
  (split apiVersion → group, with "v1"-shape → core).

Pin tests:
- TestResourceKey_GroupAware: distinct keys across groups.
- TestIndexByResource_NoCrossGroupCollision: lookup per-group returns
  only its own findings.
- TestGroupForBuiltinKind: the table.

Closes #35.
… the wire

auditSummary.highestSeverity used to emit raw audit vocabulary
("danger" / "warning") while sibling issueSummary.highestSeverity
emits issue vocabulary ("critical" / "warning"). Two fields under the
same resourceContext disagreeing on what "highest severity" means is
a real wire-shape footgun — consumers parsing one will mis-handle the
other.

Mirrors the same mapping internal/issues.fromAudit applies when audit
findings flow through the unified issue stream, so the two paths now
agree. Empty / future audit severities pass through unchanged so the
contract stays explicit if pkg/audit grows new values.

Pinned by TestNormalizeAuditSeverity.
…ption fixes

Bundle of bench-validated agent-surface improvements developed and tested
on this combined branch. Not yet split into per-task PRs — see the wave-3
plan for the eventual split.

T7 — MCP get_resource default-on resourceContext
- internal/mcp/rc_rbac.go: request-scoped RefAccessChecker adapter
- internal/mcp/resource_context.go: MCP-side IssueSummary/AuditSummary/
  PolicyReport composers + topology lookup
- internal/mcp/tools.go: getResourceInput.Context, group-first dispatch,
  Build wrapping into {resource, resourceContext} envelope

resourcecontext additions (T7-supporting + sibling)
- ServiceBackends lookup interface (resolve Service selector → ready Pods)
- PodSummary, WorkloadSummary, StatusSummary projections
- Owner ref handling + audit severity normalization fallthroughs
- Tests across build.go and types.go

Search content-field matching (B5 win driver)
- internal/search/score.go: ContentField type, content-exact/substring
  scoring, MatchSnippet generation with bounded excerpt size
- internal/search/candidate.go: candidate match assembly
- internal/search/types.go: MatchSnippet struct + content:path site
- Hooked into ConfigMap/Secret data, workload env refs, CRD spec fields

Log pipe primitives (kubectl-arm parity)
- pkg/ai/context/logs.go: FilterLogsByPattern (server-side grep)
- internal/mcp/tools.go + tools_workloads.go: get_pod_logs +
  get_workload_logs accept grep, since (kubectl --since), previous
  (kubectl -p for CrashLoopBackOff diagnosis)
- parseLogsSince helper with full unit tests

Tool description tightening (descriptions are the highest-leverage knob)
- tools_audit.go: "compliant resources are not returned" — addresses
  the B7 audit false-positive vector
- tools_helm.go, tools_neighborhood.go, tools_rbac.go, tools_apply.go,
  tools_workloads.go: minor description fixes
@nadaverell nadaverell requested a review from hisco as a code owner May 21, 2026 00:02
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 3d33d73. Configure here.

}
filters := issues.Filters{
Kinds: []string{kind},
Limit: issues.MaxLimit,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue summary uses MaxLimit instead of NoLimit

Medium Severity

Both computeMCPIssueSummary and computeIssueSummaryForResource pass issues.MaxLimit (1000) as the Filters.Limit, but the newly added NoLimit sentinel and its documentation explicitly state that per-resource issue indexes need NoLimit to avoid silently dropping counts on clusters with more than 1000 issues. The sibling BuildIssueIndex in internal/summarycontext correctly uses issues.NoLimit. Using MaxLimit here means that on large clusters, issues beyond the 1000th sorted row are truncated before the per-resource name/group filter runs, silently zeroing issueSummary for affected resources.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3d33d73. Configure here.

Comment thread internal/mcp/tools.go
filtered, err := aicontext.FilterLogsByPattern(string(data), input.Grep)
if err != nil {
return nil, nil, fmt.Errorf("invalid grep regex: %w", err)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate grep regex validation in pod logs handler

Low Severity

In handleGetPodLogs, the grep regex is compiled and validated at line 1349 (regexp.Compile), then FilterLogsByPattern compiles it again at line 1378 and the error is caught a second time. The first validation is redundant since FilterLogsByPattern already returns the compile error. The workload logs handler (handleGetWorkloadLogs) has the same pattern but at least doesn't re-check the error from FilterLogsByPattern (it discards it with _). The inconsistency between the two handlers is confusing.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3d33d73. Configure here.


// Apply AI-optimized log filtering
entry.Logs = aicontext.FilterLogs(string(data))
filtered, _ := aicontext.FilterLogsByPattern(string(data), input.Grep)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workload logs silently discards FilterLogsByPattern error

Low Severity

In handleGetWorkloadLogs, the error from aicontext.FilterLogsByPattern is discarded with _ on line 292. While the grep regex is pre-validated earlier (line 211), this differs from handleGetPodLogs which checks the error from FilterLogsByPattern. If the pre-validation were ever removed or bypassed, invalid regex errors would be silently swallowed, returning empty/incorrect filtered logs instead of an error.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3d33d73. Configure here.

Comment thread internal/server/server.go Outdated
Comment thread internal/server/server.go Outdated
Comment thread internal/server/server.go Outdated
Comment thread internal/server/server.go Outdated
Comment thread internal/server/server.go Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants